| Author | |
|---|---|
| Name | Claire Descombes |
| Affiliation | Universitätsklinik für Neurochirurgie, Inselspital Bern |
| Degree | MSc Statistics and Data Science, University of Bern |
| Contact | claire.descombes@insel.ch |
The reference material for this course, as well as some useful literature to deepen your knowledge of R, can be found at the bottom of the page.
When you want to load a file (e.g. a dataset), you have two options:
If your file is stored in the folder that is currently set as your working directory, you can simply write:
setwd("C:/path/to/your/folder/")
data <- read.csv("testdata.csv")
# This is convenient because you can move the entire folder around without breaking your code: as long as you set the working directory to that folder when you open the script, everything still works.
If you want some structure inside your project (e.g. datasets stored in a subfolder “datasets”), you can use relative paths, which always start from the working directory:
setwd("C:/path/to/your/folder/")
data <- read.csv("datasets/testdata.csv")
# This is safe and portable: moving the whole folder keeps the relative paths valid.
If your files are scattered across your computer, you may prefer to specify the absolute path each time:
data <- read.csv("C:/some/other/folder/testdata.csv")
# This avoids having to change the working directory, but: the code breaks if the file moves, and the script is harder to share with others (everyone has different folder structures).
Working directory
To tell R which folder you are working in (e.g., where your data is stored), you have several options:
setwd("C:/path/to/your/folder") in your
script/console.# Command to display the current working directory
getwd()
# Command to manually set your working directory
setwd("C:/path/to/your/folder")
# Command to automatically set your working directory to the location of your R file
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
We will first look at how to import a CSV file into R as a data frame.
CSV stands for Comma-Separated Values. In a .csv file,
the values are stored as plain text, separated by commas. This is a
simple and widely used format for storing tabular data.
After setting your working directory or determining the path to your
CSV file, you can use the read.csv() function to import the
data. This will create a data frame, which is one of the most commonly
used structures in R for handling datasets.
💡 I recommend to use data frames as data type for your data: they are generally easier to work with than matrices, especially for beginners.
# Import a CSV file into a data frame
dataset <- read.csv("C:/path/to/your/folder/data.csv")
The function read.csv() has several useful
arguments:
read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", row.names,
stringsAsFactors, ...)
header: A logical value (TRUE/FALSE)
indicating whether the file contains the names of the variables as its
first line. If missing, the value is determined from the file format:
header is set to TRUE if the first row contains one fewer
field than the number of columns.
sep: The field separator used in the file. For
read.csv(), the default is a comma (,), which
is standard for CSV files.
row.names: Specifies the row names of the data
frame. It can be:
If a header is present and the first row has one fewer field than the
number of columns, the first column is used as row names. Otherwise,
rows are automatically numbered. Use row.names = NULL to
force default numbering.
col.names: Optional vector of column names. If not
provided, default names like “V1”, “V2”, etc., are assigned.
stringsAsFactors: TRUE/FALSE; should
character vectors be converted to factors?
Another widely used data format is the Excel file (.xlsx
or .xls). For these, you can use the readxl
package to import the data:
# Load the readxl package (after installing it)
library(readxl)
# Read the first sheet of an Excel file
dataset <- read_excel("C:/path/to/your/folder/data.xlsx")
The function read_excel() also has several useful
arguments:
read_excel(path, sheet = NULL, range = NULL,
col_names = TRUE, col_types = NULL, na = "",...
)
path: Path to the xls/xlsx file.
sheet: Sheet to read. Either a string (the name of a
sheet), or an integer (the position of the sheet). Ignored if the sheet
is specified via range. If neither argument specifies the sheet,
defaults to the first sheet.
range: A cell range to read from, as described in
cell-specification. Includes typical Excel ranges like “B3:D87”,
possibly including the sheet name like “Budget!B2:G14”, and
more.
col_names: TRUE to use the first row as
column names, FALSE to get default names, or a character
vector giving a name for each column.
col_types: Either NULL to guess all
from the spreadsheet or a character vector containing one entry per
column from these options: “skip”, “guess”, “logical”, “numeric”,
“date”, “text” or “list”.
na: Character vector of strings to interpret as
missing values. By default, readxl treats blank cells as
missing data.
⚠️ Note: If your file is actually a CSV but mistakenly has a
.xlsx extension, you should rename it to .csv
and use read.csv() instead. Mixing up formats can lead to
import errors.
Let us now look at real data frames to learn how to call or modify
their elements. To do this, we will use multiple health data sets from
the National Health and Nutrition Examination (NHANES) Survey
from 2011-2012. The survey assessed overall health and nutrition of
adults and children in the United States and was conducted by the
National Center for Health Statistics (NCHS). The data sets can be found
in the data_sets
folder. More details on these data sets can be found in the Appendix
A.
✏️ Exercise 1: import the demo, bpx,
bmx and smq data sets from the data_sets
folder into R.
tidyverseBase R, without any additional packages, already provides many functions that are very handy for data handling. However, some contributed packages make data preparation much easier and more readable.
I’ll introduce two such packages here, before diving into concrete
data handling examples. Both are part of a larger and very powerful
collection of packages for data science called the
tidyverse, which I use for nearly all my analyses.
💡 In the Appendix B, you will find a table containing useful
functions from both Base R and the tidyverse that
facilitate efficient data handling.
One of the most downloaded contributed extension packages of all
times is magrittr. It provides a very useful operator, the
forward pipe operator %>%, which passes the object on
its left as the first argument to the function on its right. This is
much easier to understand with an example.
# The easiest way to get magrittr is to install the whole tidyverse
install.packages("tidyverse")
# Once installed, a package has to be loaded to be used
library(tidyverse)
library(tidyverse)
# Let's do the same operation twice: once using the pipe, once without
# No pipe:
str(c(1,2,3,4))
## num [1:4] 1 2 3 4
# With pipe:
c(1,2,3,4) %>%
str()
## num [1:4] 1 2 3 4
# Not too exciting yet, but consider a more complex case:
summary(log(sqrt(na.omit(c(1, 4, NA, 16, 25)))))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.5199 1.0397 0.9222 1.4421 1.6094
# With the pipe, we can rewrite this more readably:
c(1, 4, NA, 16, 25) %>%
na.omit() %>%
sqrt() %>%
log() %>%
summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.5199 1.0397 0.9222 1.4421 1.6094
The pipe helps turn nested function calls into a sequence of simpler,
linear steps. This makes code easier to read, write, and debug. The pipe
becomes especially powerful when used with functions from the
dplyr package for data manipulation.
dplyrAnother helpful R package is dplyr. It is a grammar of
data manipulation, providing a consistent set of verbs that helps solve
the most common data manipulation challenges.
Let’s illustrate this with a simple example. Our goal: Group the cars dataset (contained in base R) by speed groups (e.g. low/medium/high), and for each group, compute (1) the average stopping distance and (2) the number of observations.
# Base R (no dplyr, no pipe)
cars$speed_group <- cut(cars$speed, breaks = c(0, 10, 20, 30),
labels = c("Low", "Medium", "High"))
avg_dist <- aggregate(dist ~ speed_group, data = cars, mean)
n_obs <- aggregate(dist ~ speed_group, data = cars, length)
names(n_obs)[2] <- "n"
summary_df <- merge(avg_dist, n_obs, by = "speed_group")
summary_df
# With dplyr, no pipe:
cars <- mutate(cars, speed_group = cut(speed, breaks = c(0, 10, 20, 30), labels = c("Low", "Medium", "High")))
summary_df <- summarise(group_by(cars, speed_group),
avg_dist = mean(dist),
n = n())
summary_df
# With dplyr and the pipe
cars %>%
mutate(speed_group = cut(speed, breaks = c(0, 10, 20, 30),
labels = c("Low", "Medium","High"))) %>%
group_by(speed_group) %>%
summarise(
avg_dist = mean(dist),
n = n()
)
💡 cut(x, ...) divides the range of x into
intervals (the breaks) and codes the values in x according
to which interval they fall. labels are the levels of the
resulting category. If labels = FALSE, simple integer codes
are returned instead of a factor.
As you can see, using dplyr and the pipe can make your
life much easier.
In the following chapter, we’ll use both base R and
tidyverse functions without always noting which package
they belong to. If you’re ever unsure, you can check the top-left corner
of the function’s help page.
Being able to access elements in a data frame is essential when working with data. Here are some common methods to select specific elements, rows, or columns.
# Look at the first respectively last few rows
head(demo)
tail(demo)
# Select columns by name
demo[, c("RIDAGEYR", "RIAGENDR")] # Selecting age in years and gender
vars <- c("RIDAGEYR", "RIAGENDR")
demo[, vars] # Alternative using variable `vars`
# Select elements by position
demo[1, 1] # Access the first element of the first column (the respondent sequence number of the 1st participant)
## [1] 62161
ind_mat <- cbind(c(1, 3, 5), c(2, 4, 6))
demo[ind_mat] # Access rows and columns using multiple indices
## [1] "NHANES 2011-2012 public release" "Male"
## [3] NA
# Select rows based on a condition
head(demo[, "RIDAGEYR"] > 50) # Logical condition for age greater than 50
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
head(!(demo[, "DMDHHSIZ"] > 3)) # Logical negation for total number of people in the household not greater than 3
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
demo[demo[, "RIDAGEYR"] > 50, ] # Rows where age > 50
demo[demo[, "DMDHHSIZ"] < 3, ] # Rows where total number of people in the household greater than 3
demo[demo[, "DMDHHSIZ"] >= 3, ] # Rows where total number of people in the household greater or equal 3